Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the project's Docker infrastructure and dependency management to better support distinct SGLang and vLLM inference backends. By splitting the Dockerfile and introducing explicit Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Ignored Files
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request does a great job of splitting the Docker build into sglang and vllm variants, which simplifies dependency management. The use of uv's conflicts feature in pyproject.toml to make the backends mutually exclusive is an excellent implementation choice. The updates to documentation and CI workflows are thorough and consistent with these changes.
My main feedback is a medium-severity maintainability concern regarding the significant code duplication between Dockerfile.sglang and the new Dockerfile.vllm. I've left a specific comment with a suggestion to refactor this using a single, multi-stage Dockerfile to reduce future maintenance overhead.
c28cc00 to
03cd54d
Compare
Parametrize GRPO tests over both sglang and vllm inference backends so the vllm CI image can run GRPO tests (previously exit code 5 because all tests were marked sglang-only). Key changes: - Add vllm config YAMLs for fsdp, megatron, and archon backends - Replace global pytestmark=sglang with per-case sglang/vllm marks - Conditionally set config.vllm.model vs config.sglang.model_path - README table reformatted by mdformat hook Refs: #985
Parametrize GRPO tests over both sglang and vllm inference backends so the vllm CI image can run GRPO tests (previously exit code 5 because all tests were marked sglang-only). Key changes: - Add vllm config YAMLs for fsdp, megatron, and archon backends - Replace global pytestmark=sglang with per-case sglang/vllm marks - Conditionally set config.vllm.model vs config.sglang.model_path - README table reformatted by mdformat hook Refs: #985
Parametrize GRPO tests over both sglang and vllm inference backends so the vllm CI image can run GRPO tests (previously exit code 5 because all tests were marked sglang-only). Key changes: - Add vllm config YAMLs for fsdp, megatron, and archon backends - Replace global pytestmark=sglang with per-case sglang/vllm marks - Conditionally set config.vllm.model vs config.sglang.model_path - README table reformatted by mdformat hook Refs: #985
Restore test_ppo_stats.py and reward_curve.png which were accidentally included in the Docker split branch. These deletions are unrelated to the sglang/vllm image separation. Refs: #985
…ase state Revert user-facing docs, example configs, and install commands to match the current released image naming and extras. These will be updated in a follow-up when the split images are officially released. Key changes: - Restore README, AGENTS.md, CLAUDE.md install commands - Restore EN/ZH installation guides (Docker image names, uv sync) - Restore SkyPilot configs and example READMEs Refs: #985
Build separate sglang and vllm Docker images from a single parameterized Dockerfile. Each variant ships only one inference backend, eliminating dependency conflicts and reducing image size. Key changes: - Dockerfile with ARG BASE_IMAGE/VARIANT for per-variant builds - CI workflows: build-docker-image, test-areal, tag-release, install-test - pyproject.toml: sglang/vllm conflicting extras, cuda renamed to cuda-train - pytest markers and conditional fixtures for backend-specific tests - GRPO tests parametrized over both sglang and vllm inference backends - Docker validation tools auto-detect installed inference variant Refs: #985
373a0d3 to
8da6b47
Compare
|
The workflow succeeds at https://github.com/inclusionAI/AReaL/actions/runs/22798596790 Pending review. |
…onflicts Runner 2.317.0 does not support node24 required by actions/checkout@v6, causing 'Set up job' failures on dynamically provisioned GCP instances. Key changes: - Bump RUNNER_VERSION from 2.317.0 to 2.332.0 in test-areal.yml and build-docker-image.yml - Remove duplicate imports in test_rollout_controller.py from rebase with PR #996
* feat(infra): split Docker image into sglang and vllm variants Build separate sglang and vllm Docker images from a single parameterized Dockerfile. Each variant ships only one inference backend, eliminating dependency conflicts and reducing image size. Key changes: - Dockerfile with ARG BASE_IMAGE/VARIANT for per-variant builds - CI workflows: build-docker-image, test-areal, tag-release, install-test - pyproject.toml: sglang/vllm conflicting extras, cuda renamed to cuda-train - pytest markers and conditional fixtures for backend-specific tests - GRPO tests parametrized over both sglang and vllm inference backends - Docker validation tools auto-detect installed inference variant Refs: #985 * fix(infra): update GCP runner to v2.332.0 and resolve rebase import conflicts Runner 2.317.0 does not support node24 required by actions/checkout@v6, causing 'Set up job' failures on dynamically provisioned GCP instances. Key changes: - Bump RUNNER_VERSION from 2.317.0 to 2.332.0 in test-areal.yml and build-docker-image.yml - Remove duplicate imports in test_rollout_controller.py from rebase with PR #996
…project#985) * feat(infra): split Docker image into sglang and vllm variants Build separate sglang and vllm Docker images from a single parameterized Dockerfile. Each variant ships only one inference backend, eliminating dependency conflicts and reducing image size. Key changes: - Dockerfile with ARG BASE_IMAGE/VARIANT for per-variant builds - CI workflows: build-docker-image, test-areal, tag-release, install-test - pyproject.toml: sglang/vllm conflicting extras, cuda renamed to cuda-train - pytest markers and conditional fixtures for backend-specific tests - GRPO tests parametrized over both sglang and vllm inference backends - Docker validation tools auto-detect installed inference variant Refs: areal-project#985 * fix(infra): update GCP runner to v2.332.0 and resolve rebase import conflicts Runner 2.317.0 does not support node24 required by actions/checkout@v6, causing 'Set up job' failures on dynamically provisioned GCP instances. Key changes: - Bump RUNNER_VERSION from 2.317.0 to 2.332.0 in test-areal.yml and build-docker-image.yml - Remove duplicate imports in test_rollout_controller.py from rebase with PR areal-project#996
…project#985) * feat(infra): split Docker image into sglang and vllm variants Build separate sglang and vllm Docker images from a single parameterized Dockerfile. Each variant ships only one inference backend, eliminating dependency conflicts and reducing image size. Key changes: - Dockerfile with ARG BASE_IMAGE/VARIANT for per-variant builds - CI workflows: build-docker-image, test-areal, tag-release, install-test - pyproject.toml: sglang/vllm conflicting extras, cuda renamed to cuda-train - pytest markers and conditional fixtures for backend-specific tests - GRPO tests parametrized over both sglang and vllm inference backends - Docker validation tools auto-detect installed inference variant Refs: areal-project#985 * fix(infra): update GCP runner to v2.332.0 and resolve rebase import conflicts Runner 2.317.0 does not support node24 required by actions/checkout@v6, causing 'Set up job' failures on dynamically provisioned GCP instances. Key changes: - Bump RUNNER_VERSION from 2.317.0 to 2.332.0 in test-areal.yml and build-docker-image.yml - Remove duplicate imports in test_rollout_controller.py from rebase with PR areal-project#996
Summary
Split the monolithic Docker image build into sglang and vllm variants using a single parameterized
Dockerfilewith build arguments. Each variant ships only one inference backend, eliminating dependency conflicts and reducing image size.User-facing documentation (installation guides, example configs, READMEs) is intentionally not updated in this PR — those changes will land in a follow-up when the split images are officially released.
Docker
DockerfilewithARG BASE_IMAGE/ARG VARIANT— builds either variant from one filelmsysorg/sglang:v0.5.7-cu129-amd64-runtimevllm/vllm-openai:v0.14.0(withENTRYPOINT []reset)ghcr.io/inclusionai/areal-runtime:{tag}-{variant}(e.g.:dev-sglang,:v1.0.2-vllm)latesttag points to the sglang variantDependencies (pyproject.toml)
sglangandvllmas conflicting extras via[tool.uv.conflicts]cudaextra →cuda-train(training packages only, no inference backend)torchaofrom direct dependencies (transitive dep of sglang only)sglangandvllmpytest markers in[tool.pytest.ini_options]CI Workflows
build-docker-image.yml: Builds both images → tests each viatest-areal.yml→ promotes to:dev-{variant}test-areal.yml: Matrix strategy withvariantparameter; per-variant GCP runners; excludes opposite backend's tests via-m "not {backend}"tag-release-image.yml: Matrix build for release tags (v1.0.2-sglang,v1.0.2-vllm)install-test.yml: Docker test matrix for both variants.pre-commit-config.yaml: Excluded.github/workflows/fromcheck-yaml(GHA expressions use non-standard YAML)Tests
@pytest.mark.sglang/@pytest.mark.vllmmarkers-m "not {backend}"test_inference_engines.py: Conditional sglang fixture (skip when sglang not installed)test_tool_call_parser.py:pytest.importorskip("sglang")guardtest_grpo.py: Parametrized over(training_backend, inference_backend)with per-case marks — runs sglang configs on sglang image, vllm configs on vllm imageValidation Tools
validate_docker_installation.py: Auto-detect installed inference variant instead of requiring bothvalidation_base.py: Removedsglangfrom base critical packages (variant-specific)What is NOT included (deferred to release)
uv synccommands)Type of Change
Checklist